Intro to R: A hands-on tutorial

Day 0: Intro to statistical programming

Sarah Strochak, Kyle Ueyama, Aaron R. Williams

R Lunch Lab

Statistical Programming

Motivation: why statistical programming?

  1. Clearly answer questions
  2. Clearly communicate the answer to questions
  3. Document the steps to answering a question

Example 1

What is 2 + 2?

Example 1

What is 2 + 2?

## [1] 4

Example 2

What is the median price of diamonds with carat > 1 and a Good cut?

Example 2

What is the median price of diamonds with carat > 1 and a Good cut?

## # A tibble: 1 x 1
##   `median(price)`
##             <int>
## 1            6412

Example 3

How could increasing the retirement age affect the poverty rates of Hispanic women ages 62 and older?

Example 3

How could increasing the retirement age affect the poverty rates of Hispanic women ages 62 and older?

Via die-seite-des-dr-caligari

Principles

Principles

  1. Accuracy
  2. Computational reproducibility
  3. Human interpretability
  4. Portability
  5. Accessibility
  6. Efficiency

1) Accuracy

Deliberate steps should be taken to minimize the chance of making an error and maximize the chance of catching errors when errors inevitably occur.

2) computational reproducibility

Computational reproducibility should be embraced to improve accuracy, promote transparency, and prove the quality of analytical work.

3) Human interpretability

Code should be written so humans can easily understand what’s happening—even if it occasionally sacrifices machine performance.

4) Portability

Analyses should be designed so strangers can understand each and every step without additional instruction or inquiry from the original analyst.

5) Accessibility

Research and data are non-rival and non-exclusive. They are public goods that should be widely and easily shared. Decisions about tools, methods, data, and language during the research process should be made in ways that promote the ability of anyone and everyone to access an analysis.

6) Efficiency

Analysts should seek to make all parts of the research process more efficient with clear communication, by adopting best practices, and by managing computation.

A survey of other programming languages

Stata

  • Common users: economists, Nate Silver
  • Strengths: out-of-the-box econometric tools, simple syntax
  • Limitations: proprietary, one data set at a time, inflexible

Photo by StataCorp LP, CC BY-SA 4.0, Unaltered

SAS

  • Common users: veteran researchers, government
  • Strengths: doesn’t use memory
  • Limitations: proprietary, expensive, clunky, inflexible, lacks environments, documentation

Matlab

  • Common users: mathematicians, engineers
  • Strengths: matrices, 3D plotting
  • Weaknesses: cost, 3D plotting

SPSS

  • Common users: psychologists
  • Strengths: point-and-click tools
  • Weaknesses: point-and-click tools, limited functionality

Python

  • Users: data scientists, computer scientists
  • Strengths: general purpose programming, extensibility, flexibility
  • Weaknesses: steep learning curve

R

  • Users: statisticians, data scientists, biostatisticians
  • Strengths: extensible, documentation, community
  • Limitations: multiple languages in one

Others

  • Julia
  • Rust
  • JavaScript
  • SQL

What you use matters less than how you use it

What you use matters less than how you use it R is the best

Comparison

Source is unknown

A brief history of R

S

R is an implementation of the S programming language, which was created at Bell Labs in the 1970s.

S-PLUS is a proprietary implementation of R that was common for years.

R

R is a free, open-soure programming language created by Ross Ihaka and Robert Gentleman at the Univesity of Aukland in the early 1990s.

R is mostly written in R, C, and FORTRAN.

CRAN

The Comprehensive R Archive Network was introduced in 1997.

Repository of popular R packages with basic standards and quality control.

tidyverse

Comprehensive set of tools for data science

Core: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

tidyverse

Free text by Hadley Wickham and Garrett Grolemund

RStudio

IDE and for-profit company that funded and professionalized R development

Fundamental concepts

Text editor/IDE

  • R == free, open source programming language
  • RStudio == for-profit company and Itegrated Development Environment (IDE)

The R console

Computational Reproducibility

  • Replication: the recreation of findings across repeated studies, is a cornerstone of science.
  • Reproducibility: the ability to access data, source code, tools, and documentation and recreate all calculations, visualizations, and artifacts of an analysis
  • Computational reproducibility should be the minimum standard for computational social sciences and statistical programming

Script

  • A plain text document that contains code and comments
  • Map to the answer
  • .R and .Rmd

Comments

  • Clear code avoids the need for describing “what”
  • Comments should focus on “why”

Coding style

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” ~ Hadley Wickham

  • CamelCase
  • camelCase
  • snake_case

tidyverse style guide

R Packages

Collections of R, C, C++, and FORTRAN code that expand the functionality of R.

Tests

Data structures

  • scalars (do not exist in R)
  • vectors
  • matrices
  • data frames, multidimensional arrays

Data types

character

numeric

logical

factor

functions/macros

Filtering

Summarization

Organizing an analysis

Ways to learn a programming language

1: use it, use it again, use it some more.

Software check